A. Introduction

1. Interpretation of GIS-referenced in-situ radiation measurements

The interpretation of GIS-referenced in-situ radiation measurements is carried out through five “typical” steps:

  1. data loading;
  2. data cleaning and pre-processing (including Geo-processing);
  3. Exploratory Data Analysis (EDA - visualize, summary statistics, getting acquainted with the dataset involved);
  4. spatial estimation (exceedance probability, …);
  5. results dissemination.

Each of these steps will be detailed and examplified in the following chapters.

2. R free software environment

R free software environment for statistical computing and graphics has been used by a large community of Statisticians and data analysts for decades. R’s killer feature is its extremely rich ecosystem of library/packages allowing to perform (among others): Statistical analysis, modelling, data cleaning, data transformation & visualization, machine learning, deep learning, GIS, Geospatial Analysis, …

Having to learn and use R programming language may look daunting for people used to “point & click” user interfaces but we do believe that the learning experience might be much pleasant and approachable than you first imagined and that it will pay off in the long term.

Here below some motivations:

  • using R to script (a sequence of data analysis steps - workflow) your analysis pipeline is definitely much easier than developping a full software or complex libary;
  • R ecosystem includes packages like the tydiverse suite and in particular dplyr, sf, … allowing to implement your “ideal” data analysis pipeline in an extremely intuitive and seamless manner;
  • RStudio IDE (Integrated Development Environment) offers an ergonomic interface where you can seamlessely code, experiment, debug your data analysis, access packages documentation, experiment different scenarios, visualize your analysis outputs, disseminate your results, …
  • there is a huge amount of extremely valuable and free learning resources online;
  • you will be able to implement reproducible research http://reproducibleresearch.net/ workflows: access, pre-process, analyse and visualize your data, document and publish your outcomes in a reproducible way.

In our particular use case, we will tap into R packages particularly relevant to the tasks at hand. It includes:

3. Google API key

Prior to July 2018, it was possible to use google static maps without signing up for the service. However, at this time, the service now requires individual sign up and also billing enablement. The billing enablement especially is a bit of a downer, but you can use the free tier without incurring charges.

The following are the steps you can follow to enable this service. It is imperative that you follow these steps properly:

3.1 Project creation and generation of API Key in Google

  • Visit https://console.cloud.google.com and sign up for a google cloud platform trial.

  • Create a project. In the top nav, you can either select an existing project or create a new one.

Create Project

Create Project

  • Add the “Maps Static API” service to the project. Navigate to the library of API services and search for “Maps Static API”. Enable the service. Generate API

  • Generate an API Key. Navigate to the credentials area and select “Create credentials”. Take note of your API key (from the ‘Credentials’ menu, copy your API key to the clipboard). You will need to register it with the package later.

Static API

Static API

  • Next add the “Geocoding API” service to the project and enable the service.

3.2 Enable billing

3.3 Update of the ggmap R package

4. How to best use this “template” data analysis workflow

This tutorial targets two distinct types of audience with different objectives:

  1. researchers, engineers, … willing to develop thorough data analysis skills and use them on a regular basis: execute, modify code snippets provided and explore/study further resources and supplemental material pointed out;

  2. decision makers willing to understand typical data analysis workflows and to assess toolbox’s suitability and potential without delving into the programming details: understand the rationale and sequence of a typical workflow; [optionnally] tweack code snippets, model parameters, visualizations, …

5. RStudio and R markdown crash course

RStudio is a free and open-source Integrated Development Environment (IDE) for R. In a nutshell, it is a single entry point to a wide range of features and functionalities allowing to perform data loading, transformation, analysis, modelling and visualization. Additionnaly, used in combination with RMarkdown package, it allows to document and publish your data analysis pipeline/workflow efficiently, thus ensuring reproducibility and fostering collaboration.

RStudio IDE, RMarkdown and R packages in general are extremely well documented. In particular, it is highly recommended to get access to the series of RStudio Cheat Sheets available here: https://www.rstudio.com/resources/cheatsheets/ (you will find them as well in pdf format in the /Resources folder of the course material). For now the two most important one are: RStudio IDE Cheat Sheet, R Markdown Cheat Sheet.

☞ For further reference, take a look as well at: http://r4ds.had.co.nz/r-markdown.html#r-markdown-basics

5.1 Inserting and executing code “chunks” in RMarkdown

An RMarkdown document is a mix of formatted text, images, links, graphs and code chunks (small piece of R code).

R code chunks are inserted surrounded by a pair of triple quote as follows (look at the next chapter to create it in a single keyboard stroke):

A typical R code chunk

A typical R code chunk

Then the whole RMarkdown document can be “Knitted” to html or whatever format you find relevant through RStudio User Interface (or associated keyboard shortcut - see next chapter).

Knit to … menu in RStudio

Knit to … menu in RStudio

It is important to note that each chunk can be executed individually and sequentially by clicking the green left arrow at the top right of a code chunk. This way, before generating/knitting the whole RMarkdown document, you can experiment/debug it.

It is important to remember than code chunks must be executed from top to bottom sequentially as later chunks will depend on calculations of previous ones.

Rstudio “Run” menu accessible at the top of a RMarkdown document is as well extremely convenient as it allows to execute all chunks above, below the current selected one

Run All Chunks above …

Run All Chunks above …

☞ Pro tip: You can as well select any line in a chunk and execute it by pressing “Shift + Enter” keyboard keys.

☞ Pro tip: Last, you can experiment any piece of code in the R console itself. This is an ideal way to interactively analyse, visualize your data, debug your code, …

Exploring through R console

Exploring through R console

5.2 Some useful keyboard shortcuts

There are dozens of keyboard shorcuts that would improve your user experience and productivity. Many of them are listed in RStudio “Help” menu. Below a curated list of extremely useful ones:

Note: “Cmd/Ctrl” means you should use “Cmd” key for a Mac keyboard and“Ctrl” for others

  • Cmd/Ctrl-Alt-I: Create code chunk

  • Cmd/Ctrl-Shift-K: “Knit”" all document

  • Cmd/Ctrl-Shift-Enter: Executing single code chunk (you can use as well the green top-left arrow)

  • alt + -: <- (R variable assignment symbol)

  • Cmd/Ctrl-Shift-M: %>% (Piping commands - see below)

  • Cmd + alt + R: Run all

  • Cmd/Ctrl-Shift-C: Comment block selected/highlighted

B. Exploratory Data Analysis

As nicely summarized by H. Wickham and G. Grolemund in their outstanding book “R for Data Science” (see reference below): Exploratory Data Analysis is an iterative cycle where you:

  1. Generate questions about your data;
  2. Search for answers by visualizing, transforming and modelling your data;
  3. Use what you learn to refine your questions and/or generate new questions.

☞ For further reference: http://r4ds.had.co.nz/exploratory-data-analysis.html (Accessible online)

In our specific case, we will load the dataset collected, clean it, look at summary statistics, transform it, look at values distribution, spot potential outliers, spatialize it, … Our ultimate goal being to gain enough knowledge to inform and prepare the spatial estimation of doses in our area of interest.

1. Data loading & transformation

Data can come in a variety of formats and structures: .csv, .xlsx and many others. In our particular case, we will read a specific sheet of an Excel file using the readxl package.

  • Let’s first read the data currently stored in two Atomtex Detector waypoint files:

☞ For further reference: http://r4ds.had.co.nz/transform.html and http://r4ds.had.co.nz/pipes.html

☞ Pro tip: From RStudio, the documentation of any function (usage, function parameters, returned value, …) can be accessed by writing in RStudio console ?funcion_name, in our case: ?read_excel.

☞ For further reference: http://r4ds.had.co.nz/data-import.html

It’s always a good idea to take a quick look at the first and last rows of a just loaded dataset.

  • Quick look at first rows:
##   dose      lat      lon source
## 1   47 47.97523 16.50849  wp_01
## 2   47 47.97523 16.50848  wp_01
## 3   47 47.97523 16.50848  wp_01
## 4   47 47.97523 16.50847  wp_01
## 5   47 47.97523 16.50847  wp_01
## 6   47 47.97523 16.50847  wp_01
  • Quick look at last rows:
##     dose      lat      lon source
## 550   51 47.97495 16.51116  wp_02
## 551   51 47.97499 16.51115  wp_02
## 552   51 47.97503 16.51118  wp_02
## 553   51 47.97505 16.51120  wp_02
## 554   51 47.97507 16.51123  wp_02
## 555   51 47.97516 16.51102  wp_02

Note: By default the first and last 5 rows are displayed but it can be further specified, for instance: head(data, 10).

2. First round of EDA

  • Getting Summary Statistics:
##       dose             lat             lon           source         
##  Min.   : 31.00   Min.   :47.97   Min.   :16.51   Length:555        
##  1st Qu.: 47.00   1st Qu.:47.97   1st Qu.:16.51   Class :character  
##  Median : 51.00   Median :47.98   Median :16.51   Mode  :character  
##  Mean   : 55.51   Mean   :47.98   Mean   :16.51                     
##  3rd Qu.: 57.00   3rd Qu.:47.98   3rd Qu.:16.51                     
##  Max.   :481.00   Max.   :47.98   Max.   :16.51
  • How many points are collected per source (name of data collector)?
## # A tibble: 2 x 2
##   source     n
##   <chr>  <int>
## 1 wp_01    243
## 2 wp_02    312

Important remark: To avoid storing all intermediate steps into “dummy” variables and code your data analysis more “fluidly”, you can perform as many operations as you want in one go using the %>% (pronounce “pipe”) operator as follows (see curated RStudio shortcuts in the introduction).

  • Histogram of dose per source :

R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.

Again, data visualization is a wide topic and ggplot offers a lot of flexibility and expressive power. Hopefully, there are great resource materials available.

☞ For further reference: * http://r4ds.had.co.nz/data-visualisation.html * https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

  • To quickly “spatialize” all data:

  • It looks like one of the source (waypoint file) has doses on the way to the site, which one?

  • Quick interactive map with various base layers:
  • [Optional] Further details on subsampling:

Whenever a piece of code looks cryptic, decompose it and execute piece by piece in the RStudio console. Below the subsampling line used above decomposed:

3. Refining the area of interest

It appears that one of the data collector collected data on the way to the site. Our ultimate goal being to produce a map of doses in “brd” site, we would like to exclude these points. There are various way to do so:

  1. using available lon and lat attributes/fields/columns and filtering measurements outside our range of interest;
  2. using GIS-like techniques to clip/crop collected points in a polygon of interest (in our case what’s call a “Convex Hull”).

3.1 Subsetting using lon and lat attributes

Let’s try to identify site’s perimeter using wp_01 data and subset wp_02’s one.

  • Excluding wp_02 measurements:
  • Subsetting the whole dataset by lon and lat range
  • Let’s “spatialize” again all subsetted data as a quick check:

  • Spatialize with satellite basemap:

The ggmap package allows to download various type of basemaps (including sattelite imagery) and to use it in combination with ggplot:

We first set the API key (see guide presented in the Introduction section)

Start Making Maps!

  • Let’s assess the density of measurements taken [Advanced]

  • Where are measurements over 150 nSv/h (location of sources)?

Let’s now visually assess where are points over 150 nSv/h as it might reveal where the gamma sources are located.

3.2 Subsetting using GIS-like functions

  • Using the sf package to enter GIS world!

Note: 4326 is the code for the World Geodesic System (WGS) Coordinate Reference System (CRS) most often used when using GPS-enabled device.

☞ For further reference on Coordinated Reference Systems and Map projections: * projections.Rmd * http://www.progonos.com/furuti/MapProj/CartIndex/cartIndex.html

☞ For further reference on sf package: https://r-spatial.github.io/sf/articles/sf1.html

Now, we have converted our initial data (R dataframe) to sp dedicated type:

## Simple feature collection with 243 features and 2 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 16.50778 ymin: 47.97438 xmax: 16.50927 ymax: 47.97545
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
##    dose source                  geometry
## 1    47  wp_01 POINT (16.50849 47.97523)
## 2    47  wp_01 POINT (16.50848 47.97523)
## 3    47  wp_01 POINT (16.50848 47.97523)
## 4    47  wp_01 POINT (16.50847 47.97523)
## 5    47  wp_01 POINT (16.50847 47.97523)
## 6    47  wp_01 POINT (16.50847 47.97523)
## 7    47  wp_01 POINT (16.50847 47.97523)
## 8    47  wp_01 POINT (16.50847 47.97523)
## 9    47  wp_01 POINT (16.50847 47.97523)
## 10   47  wp_01 POINT (16.50847 47.97523)

As you see, a bounding box is calculated around the points, we have a projection and a special geometry column. We are now ready to perform GIS-like operations.

  • Let’s calculate the convex hull/envelope (polygon surrounding all of our points of interest):

  • Compute area of Region of Interest:
## 7800.48 [m^2]
  • Compute perimeter of Region of Interest:
## 335.4038 [m]
  • And now subset all data using only measurements within the convex hull:
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries

B. Spatial estimation & Geostatistics

So far we have used two extremely useful R packages dplyr and sf when dealing with data analysis in general and geo-data in particular.

In this section we want to estimate dose rates at unobserved locations. There are non-geostatistical and geostatistical methodologies to do so. The main difference is that in the latter scenario, we want to take advantage of the spatial correlation. As Tobler’s law states “everyhing is related to everything else, but near things are more related than distinct things” https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography.

Geostatistics provides a set of tools and methodologies allowing to take advantage of this spatial correlation (when present) to provide better estimates.

There are two additional package tremendously useful when you want to perform Geostatistics, named sp and gstat.

  • Let’s convert first our dataset to UTM 39N map projection:

So far, we have used the longitude and latitude to locate our measurements, however in this part we will use a lot distances between points and while it is possible to measure distances in degrees it would be more “meaningful” to use meters instead. To do so we need to project our data into UTM map projection for our region of interest (see again the short briefing on CRS and map projections for further details).

  • Taking a look at max X and Y distances:
## [1] "Delta X max (m)"
## [1] 112.0914
## [1] "Delta Y max (m)"
## [1] 119.0308
  • Adding a column of distances from highest dose location [optional]:

As stated above, the gstat package we will use to perform Geostatistics requires sp compatible objects.

  • We first need to make our data “sp-friendly” as follows:
##   dose source      distance
## 1   47  wp_01 10.402626 [m]
## 2   47  wp_01 10.162041 [m]
## 3   47  wp_01  9.722122 [m]
## 4   47  wp_01  9.475296 [m]
## 5   47  wp_01  9.428866 [m]
## 6   47  wp_01  9.691418 [m]
  • Let’s retake a quick look at dose histogram:

1. Variography

  • Renaming coordinates names (for further use)
## [1] "coords.x1" "coords.x2"

Kriging is not possible when the covariance matrix is singular (not invertible). This is the case when there are duplications locations. Hence, it is always safer to remove them.

  • Remove duplicated locations:
##      [,1] [,2]
## [1,]    6    7
## [2,]   66   67
## [3,]  249  250
## [4,]  327  328
## [5,]  406  407

Hence to remove them:

  • Assessing spatial correlation:

  • Plotting variogram:

The variogram is a visualization tool allowing to assess and model spatial correlation. The whole field of geostatistics relies on a set of strict assumptions such as: * Gaussion distribution of your data; * Stationarity: the mean, variance and variogram should remain the same whatever the sub-region of your region of interest.

This last assumption is a strong one leading to many avatars of geostatistics-based interpolation techniques such as: ordinary Kriging, co-Kriging, … In most cases the main goal is to “detrend” your data (having a constant mean over your area of interest - whatever the subregion). To do so, you can use a linear model using distance or simply x and y coordinates as predictors or more recent non-linear machine learning techniques.

☞ For further reference:

  • The “summarized/aggregated” version:

☞ For further reference on R’s formula: http://r4ds.had.co.nz/model-basics.html#formulas-and-model-families

  • Fitting variogram:
Fitted variogram parameters

Fitted variogram parameters

##   model      psill    range
## 1   Nug   87.13567  0.00000
## 2   Sph 1109.08437 14.79219

Different models are available:

##    short                                      long
## 1    Nug                              Nug (nugget)
## 2    Exp                         Exp (exponential)
## 3    Sph                           Sph (spherical)
## 4    Gau                            Gau (gaussian)
## 5    Exc        Exclass (Exponential class/stable)
## 6    Mat                              Mat (Matern)
## 7    Ste Mat (Matern, M. Stein's parameterization)
## 8    Cir                            Cir (circular)
## 9    Lin                              Lin (linear)
## 10   Bes                              Bes (bessel)
## 11   Pen                      Pen (pentaspherical)
## 12   Per                            Per (periodic)
## 13   Wav                                Wav (wave)
## 14   Hol                                Hol (hole)
## 15   Log                         Log (logarithmic)
## 16   Pow                               Pow (power)
## 17   Spl                              Spl (spline)
## 18   Leg                            Leg (Legendre)
## 19   Err                   Err (Measurement error)
## 20   Int                           Int (Intercept)

If you want to have an idea of their “shape”, you can simply run:

2. Krigging

3. Handling colour palettes

All cartographers, soon or later in their careers, will realize the singular role played by colors in map design. Arthur H. Robinson, in 1952 brillantly captured this thought:

For some reason, as soon as color is applied to a map the color becomes a focus of criticism, and everyone, regardless of his familiarity with the principles of color use, seems to feel that he is entitled to comment upon the color on the basis of his own likes and dislikes

Principles of color uses for Thematic Mapping have been well studied, theorized and applied for a long time. So it is better to rely on best practices. ColorBrewer: http://colorbrewer2.org/ provides a set of colour palettes (quantitative and qualitative) complying with the highest mapping standards. Fortunately, there is an R package “RColorBrewer” allowing to take advantage of it.

2.4 Controlling colour breaks

Colour encoding forces us to discretize our data (here air dose). This is a clear loss of information and definitely not a simple or random choice and excellent way to “Lie with maps” https://press.uchicago.edu/ucp/books/book/chicago/H/bo27400568.html

Let’s take a closer look at this critical step:

  • Automatic computing of breaks:

☞ For further reference on “sapply” like functions: https://bookdown.org/rdpeng/rprogdatascience/loop-functions.html

  • Notes: “npc” stants for Normalised Parent Coordinates

  • Assessing breaks relevance:

Let’s first create a utility function that plot an histogram of values with breaks overlayed:

4. Gridded prediction using Inverse Distance Weighting (IDW)

For reference, Inverse Distance Weighting is one of the most straightforward interpolation method, estimating unobserved location values as a weighted averages of surrounded observed ones based on their distances. This can be achieve very easily in R:

## [inverse distance weighted interpolation]
  • Mapping inteporlated values:

  • Interactive map with satellite base map: